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Abstract 

This paper presents various improvements that were applied to OCamlJit2, 
a Just-In-Time compiler for the OCaml byte-code virtual machine. OCamlJit2 
currently runs on various Unix-like systems with x86 or x86-64 processors. The 
improvements, including the new x86 port, are described in detail, and performance 
measures are given, including a direct comparison of OCamlJit2 to OCamlJit. 

1 Introduction 

The OCaml system [TTJ [16] is the main implementation of the Caml language [6], fea- 
turing a powerful module system combined with a full-fledged object-oriented layer. It 
comes with an optimizing native code compiler ocamlopt, a byte-code compiler ocamlc 
with an associated runtime ocamlrun, and an interactive top-level ocaml. OCamlJit [18] 
and OCamlJit2 [T3] provide Just-In-Time compilers for the byte-code used by ocamlrun 
and ocaml. 

We describe a set of improvements that were applied to OCamlJit2, including a new 
port to the x86 architecture^ and some interesting optimizations to further improve the 
execution speed of the JIT engine. OCamlJit2 is open-so urcd3 and is verified to work 
with Linux/amd64, Linux/i386 and Mac OS X (64-bit Intel Macs only), but should also 
run on other Unix-like systems with x86 or x86-64 processors. 

The paper is organized as follows: Section [2] mentions existing systems and relevant 
sources of information. Section [3] describes the improvements applied to OCamlJit2, in 
particular the new x86 port and various floating-point optimizations. Detailed performance 
measures are given in section HJ including a direct comparison with OCamlJit. Section 
concludes with possible directions for further work. 

1 Also known als i386 or IA-32 architecture, but we prefer the vendor-neutral term. 
2 The full source code is available from https://github.com/bmeurer/camljit2/ under the terms of 
the Q Public License and the GNU Library General Public License. 



2 Existing systems 



The implementation of OCamlJit2 described below is based on a previous prototype 
of OCamlJit2 [15], and earlier work on OCamlJit [H] and OCaml [IDl HB H2J □!]. 
We assume that the reader is familiar with the internals of the OCaml byte-code and 
runtime, and the structure of the OCamlJit and OCamlJit2 Just-In-Time compilers. 
An overview of the relevant parts of the OCaml virtual machine is given in both [T5] and 
[18] . while [16] provides the necessary insights on the OCaml language. 

3 Improvements 

This section describes the improvements that were applied to the OCamlJit2 proto- 
type described in [15]. This includes the new x86 port (section I3.ip . a simple native 
code managment strategy (section 13.21) . as well as improvements in the implementation of 
floating-point primitives (section l3.3j) . which reduced the execution time of floating-point 
benchmarks by up to 30%. Readers only interested in the performance results may skip 
straight to section HJ 

3.1 32-bit support 

OCamlJit2 now supports both x86-64 processors (operating in long mode) as well as x86 
processors (operating in protected mode). This section provides a brief overview of the 
mapping of the OCaml virtual machine to the x86 architecture, especially the mapping of 
the virtual machine registers to the available physical registers. See [15] for a description 
of the implementation details of the x86-64 port. 

The x86 architecture [H] provides 8 general purpose 32-bit registers °/ eax, °/„ebx, %ecx, 
%edx, °/ ebp, 7 esp, °/„edi and %esi, as well as 8 80-bit floating-point registers organized 
into a so-called FPU stack (and also used as 64-bit MMX registers). Recent incarnations 
also include a set of 8 SSE2 registers °/ xmmO, . . ., °/ xmm7. The System V ABI for the x86 
architecture [T7] , implemented by almost all operating systems running on x86 processors, 
except Win32 which uses a different ABI, mandates that registers °/ ebp, °/ ebx, °/ edi, %esi 
and °/ esp belong to the caller and the callee is required to preserve their values across 
function calls. The remaining registers belong to the callee. 

To share as much code as possible with the x86-64 port, we use a similar register 
assignment for the x86 architecture. This includes making good use of callee-save registers 
to avoid saving and restoring too many aspects of the machine state whenever a C function 
is invoked. Our register assignment therefore looks as follows: 

The virtual register accu is mapped to °/„eax. extra_args goes into °/ ebx, which - 
just like on x86-64 - contains the number of extra arguments as tagged integer. The 
environment pointer env goes into °/ ebp, and the stack pointer sp goes into °/ esi. %edi 
contains the cached value of the minor heap allocation pointer caml_young_ptr to speed 
up allocations of blocks in the minor heap; this is different from ocamlopt on x86, where 
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caml_young_ptr is not cached in a register (unlike x86-64 where ocamlopt caches its value 
in 7„rl5). 

The setup and helper routines that form the runtime of the JIT engine are located 
in the file byterun/j it_rt_i386 . S. These are mostly copies of their x86-64 counterparts 
in the file byterun/j it_rt_amd64.S, adapted to the x86 architecture. The adaption was 
mostly straight-forward, replacing the x86-64 registers with their x86 counterparts and the 
64-bit opcodes with their 32-bit counterparts. 

3.1.1 Address mapping and on-demand compilation 

We use the existing scheme [15J to map byte-code to native machine code address, which 
replaces the instruction opcode word within the byte-code sequence with the offset of the 
generated native code relative to the base address caml_j it_code_end. Whenever jumping 
to a byte-code address - i.e. during closure application or return - this offset is read from 
the instruction opcode, caml_jit_code_end is added to it, and a jump to the resulting 
address is performed. The trampoline code for x86 - shown in Figured] - is therefore quite 
similar to the trampoline code for x86-64 (°/„ecx contains the address of the byte-code to 
execute and °/„edx is a temporary register). 

movl (°/„ecx) , °/ edx 

addl caml_jit_code_end, °/ edx 

jmpl *°/„edx 

Figure 1: Byte-code trampoline (x86) 

Adapting the on-demand compilation driver was also straight-forward, due to the sim- 
ilarities of the x86 and x86-64 architectures. It was merely a matter of adding a x86 
byte-code compile trampoline - shown in Figure [2] - which is slightly longer than its x86- 
64 counterpart, because the C calling convention [17] requires all parameters to be passed 
on the C stack. 

pushl °/„eax 
pushl °/„ecx 

call caml_j it_compile 
movl °/„eax, %edx 
popl °/„ecx 
popl °/„eax 
jmpl *°/ edx 

Figure 2: Byte-code compile trampoline (x86) 

Whenever the byte-code compile trampoline is invoked, leax contains the current accu 
value, %ecx contains the byte-code address and the remainder of the OCaml virtual 
machine state is located in global memory locations and callee-save registers. Therefore 
%eax has to be preserved on the C stack and 7 ecx must be passed as parameter to the 
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caml_j it_compile function. Upon return °/„eax contains the address of the generated na- 
tive code, which is saved to 7„edx. Afterwards the stack frame is removed, restoring °/ eax, 
and execution continues at the native code address. 

The remaining porting effort was mostly related to generalizing the code emitter pre- 
processor macros to conditionally emit x86 or x86-64 code, and fiddling with some nasty 
details of the different addressing modes. The core of the code generator is almost free of 
#ifdef 's, because the structure of the generated code for the two targets is pretty much 
equivalent. This is especially true for floating-point operations, which use SSE2 registers 
and instructions on both x86 and x86-64. OCamlJit2 therefore requires a processor with 
support for the SSE2 extension [HE]; if OCamlJit2 detects a processor without SSE2, it 
disables the JIT engine and falls back to the byte-code interpreted. 

3.2 Native code management 

With our earlier prototype, memory allocated to a chunk of generated native machine 
code was never freed by OCamlJit2, even if the byte-code segment, for which this native 
machine code was generated, was released (using the Meta. static_release_bytecode 
OCaml function). This may cause trouble for MetaOCaml [T3] and long-lived interactive 
top-level session, where the life-time of most byte-code segments is limited. This was 
mostly due to the fact that all generated native machine code was stored incrementally 
in one large memory region, with no way to figure out which part of the memory region 
contains code for a certain byte-code segment. We have addressed this problem with a 
solution quite similar to that used in OCaml Jit |18j . 

segment 1 native code area 



segment 2 



chunki 


> 






chunk2 








chunky 


> 











Figure 3: Byte-code segments and native code chunks 

Instead of generating native machine code incrementally into one large region of mem- 
ory, we divide the region into smaller parts, called native code chunks, and generate code 
to these chunks. Every byte-code segment has an associated list of chunks, allocated from 

3 This is only relevant in case of x86, as all x86-64 processors are required to implement the SSE2 
extension. 
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the large memory region, now called chunk pool, which contain the generated code for the 
given byte-code segment, as shown in Figure [31 

Every segment starts out with an empty list of chunks. As soon as on-demand compi- 
lation triggers for the segment, a new chunk is allocated from the chunk pool and native 
machine code is generated into this chunk until it is filled up; when this happens, the code 
generator allocates another chunk for this segment, and emits a jmp instruction from the 
previous chunk to the new chunk if necessary Once a byte-code segment is freed using 
Meta. static_release_bytecode, all associated native code chunks are also released. This 
way OCamlJit2 no longer leaks native code generated for previously released byte-code 
segments. 

While this technique is both simple and effective, there are also several drawbacks. The 
speed of both the JIT compiler and the generated code is somewhat dependent on the size 
of the native code chunks. A small chunk size helps to reduce the size of the working set 
and the amount of memory wasted in long-lived interactive top-level sessions, but on the 
other hand decreases the throughput of the JIT compilation driver and leads to somewhat 
less efficient execution for common byte-code programs which use only a single byte-code 
segment. We have settled on a chunk size of 256 KiB for now, which seems to provide a 
good compromise. 

A possible way to reduce the amount of wasted memory with the interactive top-level 
would be to store code generated for small byte-code segments to some special, shared 
code chunks, and manage the life-time of these shared chunks using a reference counting 
scheme. 

3.3 Floating-point optimizations 

OCaml uses a boxed representation for floating-point values. It does this for various 
reasons, i.e. to simplify the interface to the garbage collector and the garbage collector 
itself. While this is an elegant and portable solution, it decreases the performance of 
OCaml programs using floating-point calculations, especially when executed with the 
byte-code interpreter. 

The optimizing native code compiler ocamlopt applies various optimizations to avoid 
generating a boxed floating-point object in the heap for each and every floating-point 
value during the execution of the program. The byte-code interpreter ocamlrun however 
has to box every floating-point value, which is certainly slower than using the available 
floating-point registers, and also causes a non-negligible load on the garbage collector. 
Both OCaml Jit and OCaml Jit2 (as described in [15]) used the same strategy as the 
byte-code interpreter, namely allocating a heap object for every floating-point value during 
the execution, but both JIT engines applied various peephole optimizations to avoid the 
overhead of calling the floating-point C primitives. 

We have implemented a new technique for OCamlJit2, which avoids the heap allo- 
cation for temporary floating-point values that appear as result of a byte-code instruction 
and are used as argument in the subsequent byte-code instruction. Figure @] shows an 
example taken from the byte-code of the almabench . ml OCAML program. 
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ccall 
ccall 



caml_array_unsaf e_get_f loat , 2 



caml_mul_f loat , 2 
caml_add_f loat , 2 
caml_add_f loat , 2 
caml_sqrt_f loat , 1 



ccall 



ccall 



ccall 



push 



Figure 4: Subsequent floating-point primitives (almabench.ml) 



Executing this piece of byte-code with the byte-code interpreter allocates five floating- 
point objects in the minor heap, one for each ccall. The first four objects will die im- 
mediately once the garbage collector is run, since they are only used as temporary results, 
while the result of the caml_sqrt_f loat call may indeed survive for a longer period of time. 
Furthermore, accessing the actual floating-point values of the temporary results requires at 
least four additional load instructions. If the temporary results would be kept in registers, 
there would be no need for the heap allocation and this would also eliminate the additional 
load instructions. While the heap allocations can seriously decrease performance, the ad- 
ditional load instructions are a minor issue, since the store-to-load forwarding techniques 
[21 [H] implemented in modern processors will usually eliminate the load from memory. 

We have implemented a clever optimization in OCamlJit2, which translates various 
floating-point primitives using SSE2 instructions and functions from the standard C math 
library, in a way that the result of each primitive is not stored in a heap-allocated object, 
but is left in the °/ xmmO register. Subsequent floating-point primitives then take the value 
from the %xmmO register instead of the memory location pointed to by 7»rax (or °/ eax). 
This process is repeated for all floating-point primitives in a row. The last floating-point 
instruction - the call to caml_sqrt_f loat in our example - then allocates a heap block for 
its result. Our optimization is particularly beneficial for the x86 port, where we were able 
to beat the optimizing native code compiler in the almabench. unsafe benchmark, but it 
also pays off for the x86-64 port, where we could reduce the execution time of floating-point 
benchmarks by up to 30% (compared to the earlier OCamlJit2 prototype). 

4 Performance 

With the x86 port in place we were finally able to compare the performance of OCamlJit 
[18] and OCamlJit2 running on the same machine. We measured the performance on 
three different systems, one x86 box for comparison with OCamlJit, and two x86-64 
machines with different processors and operating systems to test-drive the recent improve- 
ments with our primary 64-bit targets: 

• A MacBook with an Intel Core 2 Duo "Penryn" 2.4 GHz CPU (3 MiB L2 Cache), 
and 4 GiB RAM, running Mac OS X 10.6.4. The C compiler is gcc-4.2.1 (Apple 
Inc. build 5664). 
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• A Fujitsu Siemens Primergy server with two Intel Xeon E5520 2.26GHz CPUs (8 
MiB L2 Cache, 4 Cores), and 12 GiB RAM, running CentOS release 5.5 (Final) with 
Linux/x86_64 2.6.18-194.17.1.el5. The C compiler is gcc-4.2.1 (Red Hat 4.1.2-48). 

• A Fujitsu Siemens Primergy server with an Intel Pentium 4 "Northwood" 2.4 GHz 
CPU (512 KiB L2 Cache), and 768 MiB RAM, running Debian testing as of 2010/11/20 
with Linux/i686 2.6.32-3-686. The C compiler is gcc-4.4.5 (Debian 4.4.5-6). 

The OCaml distribution used for the tests is 3.12.0. The OCamlJit2 version is the 
tagged revision ocamlj it2-2010-tr2, compiled with a gcc optimization level of -0 (we 
used -03 in the previous measurement, but that seems to trigger compilation bugs with 
recent gcc versions). For OCamlJit we had to use OCaml 3.08.4, because building it 
with 3.12.0 caused mysterious crashes in some test cases. We used the most recent version 
of Gnu Lightning p|| available from the Git repository at the time of this writing (commit 
d2239c223ad22a0e9d7a9909c46d2ac4e5bc0e7f). 

The benchmark programs used to measure the performance are the following test pro- 
grams from the testsuite/tests folder of the OCAML 3.12.0 distribution: 

• almabench is a number-crunching benchmark designed for cross- language compar- 
isons, almabench. unsafe is the same program compiled with the -unsafe compiler 
switch. 

• bdd is an implementation of binary decision diagrams, and therefore a good test for 
the symbolic computation performance. 

• boyer is a term manipulation benchmark. 

• f f t is an implementation of the Fast Fourier Transformation. 

• nucleic is another floating-point benchmark. 

• quicksort is an implementation of the well-known QuickSort algorithm on arrays 
and serves as a good test for loops. 

• soli is a simple solitaire solver, well suited for testing the performance of non-trivial, 
short running programs. 

• sorts is a test bench for various sorting algorithms. 

For our tests we measured the total execution time of the benchmarks, given as combined 
system and user CPU time, in seconds. Table [T] lists the running times and speedups for 
the Intel Core 2 Duo, table [2] for the Intel Xeon, and table [3] for the Intel Pentium 4. 
We compare the byte-code interpreter time t^yt to the OCamlJit time tju (if available), 
the OCamlJit2 time tjU2, and the time t op t taken by the same program compiled with 
the optimizing native code compiler ocamlopt. The tables also list the relative speedups 

jit _ tbyt jit2 _ hyt_ opt _ tftyt jit2 _ tjit_ opt _ tj^ , opt _ tj^2 w l, pTp Kicrcrpr 

a byt ~ t]lt ' a byt ~ t]lt2 ' a byt ~ t opt > 'jit ~ tjia ' °i« ~ t opi > aIld ° jitl ~ t opt ' WtleTe bl ^ eT 



7 



111 V UCdtlUIl 


time (cpu sec 


■) 






&JJCCU.U.JJ 




command 


tbyt tjit 


tjit2 


topt 


jit 

a byt 


jit2 
a bvt 


opt jit2 
a byt a jit 


opt opt 
a jit a jit2 


almabench 


27.61 


8.58 


4.47 




3.22 


6.17 


1.92 


almabench . unsafe 


27.54 


6.14 


4.O0 




A A Q 


D.OO 


1.41 


bdd 


8.46 


2.00 


0.67 




4.23 


12.66 


2.99 


boyer 


4.33 


1.66 


1.05 




2.61 


4.11 


1.57 


fft 


5.69 


1.98 


0.64 




2.88 


8.96 


3.11 


nucleic 


14.77 


3.24 


0.80 




4.56 


18.53 


4.06 


quicksort 


6.78 


1.28 


0.23 




5.31 


29.22 


5.50 


quicksort .unsafe 


4.07 


0.84 


0.19 




4.86 


21.07 


4.34 


soli 


0.17 


0.04 


0.01 




4.81 


17.30 


3.60 


soli .unsafe 


0.14 


0.02 


0.01 




6.85 


17.12 


2.50 


sorts 


19.42 


7.24 


3.71 




2.68 


5.23 


1.95 



Table 1: Running time and speedup (Intel Core 2 Duo, Mac OS X 10.6) 
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Table 2: Running time and speedup (Intel Xeon, CentOS 5.5) 
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Table 3: Running time and speedup (Intel Pentium 4, Debian testing) 
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Figure 5: Speedup relative to the byte-code interpreter 



values are better. The times were collected by executing each benchmark 5 times with 
every engine, and using the timings of the fastest run. 

Figure O highlights the speedup of the JIT engines relative to the byte-code interpreter 
on the three test systems. The x86-64 port received another nice speedup in the floating 
point benchmarks compared to the earlier prototype due to the improvements described in 
section 1331 On x86 OCamlJit2 provides a performance boost of 2.2 to 9.6 compared to 
the byte-code interpreter, and is 1.1 to 2.0 times faster than OCamlJit. This is especially 
noticable with short running programs like soli .unsafe, where OCamlJit2 benefits from 
the reduced compilation overhead, and in floating-point programs, where OCamlJit wins 
because of clever floating point optimizations and its use of the SSE2 extension. 

Figure |6] shows the performance of the byte-code interpreter and the JIT engines relative 
to the optimizing native code compiler ocamlopt. One rather surprising result was the bad 
performance of the generated x86 native code for the almabench . unsafe benchmark, where 
OCamlJit2 was able to beat the native code compiler by a factor of 1.2. This may be 
related to the use of SSE2 instead of x87 instructions, which are generally faster, but it also 
seems that we have spotted a problem within the x86 port of ocamlopt here (unfortunately, 
we were unable to track down the issue). On a related note, we have also spotted some 
issues with the x86-64 floating-point code generated by ocamlopt and already submitted 
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Figure 6: Performance relative to ocamlopt 

an appropriate patcf@, which will be available in OCaml 3.12.1, yielding performance 
improvements of 6 — 13% with floating-point programs. 

In general, floating-point benchmarks benefit a lot when used with OCamlJit2, which 
was somewhat expected, in particular with the optimizations implemented in the latest pro- 
totype. OCamlJit does a respectable job, but is unable to compete with OCamlJit2 
performance- wise. This is probably caused by the better compilation scheme used by 
OCamlJit2 and also related to the use of Gnu Lightning within OCamlJit. Never- 
theless, it is this use of Gnu Lightning which makes OCamlJit slightly more portable 
than OCamlJit2 (three supported platforms in case of OCamlJit, compared to only 
two with OCaml Jit2). 



4 http : //caml . inria. f r/mant is/view. php?id=5180 
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5 Conclusions and further work 



Our results show that Just-In-Time compilation of OCaml byte-code can lead to some 
significant speedup (at least twice as fast the byte-code interpreter in all benchmarks). 
Starting out with a simple compilation scheme and applying some clever optimizations led 
to impressive performance gains. But we have also noticed that our approach is somewhat 
limited, which is mostly related to the nature of the OCaml byte-code and the design 
choices made for the interpreter runtime. The OCaml byte-code and runtime were cer- 
tainly designed with "fast interpretation" in mind [10J and perform quite well in this area, 
but this same fact also limits the possibilities for effective Just-In-Time compilation if one 
wants to avoid touching too many aspects of the runtime. Using a register machine as 
used by Dalvik [5] or Parrot [7] instead of the stack machine for the byte-code virtual ma- 
chine would make it easier to apply JIT techniques - it would in fact make Just-In-Time 
compilation the natural implementation choice for the virtual machine. 

Implementations of other runtimes, like the various JVMs [13J or the Common Language 
Runtime [5], show that it is indeed possible to perform efficient JIT compilation with stack 
virtual machines, but these runtimes use expensive compilation techniques for instruction 
selection, register allocation and instruction scheduling, whose applicability is questionable 
within the scope of the OCaml byte-code runtime. For example, in order to effectively 
reduce the overhead of closure application and return in OCaml byte-code execution, 
one would most likely need to perform interprocedural register allocation, which mandates 
the availability of global control and data flow information, both of which are difficult to 
collect efficiently It may indeed be possible to design a Just-In-Time compiler for the 
OCaml byte-code, which generates code as fast as the code generated by ocamlopt, using 
standard compilation techniques [2], but such a JIT engine comes at a high cost with 
respect to maintainability and execution speed of the JIT compiler, and it is questionable 
whether it is worth this cost, especially since there is already an optimizing native code 
compiler, which limits the possible use cases for the Just-In-Time compiler. 

The main application of OCamlJit2 is the interactive top-level and other dynamic 
code generation environments like MetaOCaml [19]. The OCaml repository already 
contains an experimental "native top-level" ocamlnat for this purpose, which uses the 
functionality of ocamlopt to generate efficient native code at runtime and execute it via 
the native code runtime. Improving ocamlnat may provide a better way to gain an efficient 
top-level, and we are currently evaluating what would need to be done in this area. 
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